ReBNN: Resilient Binary Neural Network

89

(a). ReActNet

(b). ReBNN

Initial

32-th

64-th

96-th

128-th

160-th

192-th

224-th

256-th

FIGURE 3.29

The evolution of latent weight distribution of (a) ReActNet and (b) ReBNN. We select

the first channel of the first binary convolution layer to show the evolution. The model is

initialized from the first stage training with W32A1 following [158]. We plot the distribution

every 32 epochs.

sign flip, thus hindering the training. Inspired by this, we use Eq. (3.150) to calculate γ

and improve performance by 0.6%, showing that considering the proportion of the weight

oscillation allows for the necessary sign flip and leads to more effective training. We also

show the training loss curves in Fig. 3.30(b). As plotted, the L curves almost demonstrate

the training sufficiency degrees. Therefore, we conclude that ReBNN with γ calculated by

Eq. (3.150) achieves the lowest training loss and an efficient training process. Note that the

loss may not be minimal at each training iteration. Still, our method is just a reasonable

version of gradient descent algorithms, which can be used to solve the optimization prob-

lem as a general one. We empirically prove ReBNN’s capability of mitigating the weight

oscillation, leading to better convergence.

Resilient training process: This section shows the evolution of the latent weight distri-

bution. We plot the distribution of the first binary convolution layer’s first channel per 32

epochs in Fig. 3.29. As seen, our ReBNN can efficiently redistribute the BNNs toward re-

silience. Conventional ReActNet [158] possesses a tri-model distribution, which is unstable

due to the scaling factor with large magnitudes. In contrast, our ReBNN is constrained by

the balanced parameter γ during training, thus leading to a resilient bi-modal distribution

with fewer weights centering around zero. We also plot the ratios of sequential weight os-

cillation of ReBNN and ReActNet for the 1-st, 8-th, and 16-th binary convolution layers

TABLE 3.7

We compare different calculation

methods of γ, including constants that

vary from 0 to 1e2 and

gradient-based calculation.

Value of γ

Top-1

Top-5

0

65.8

86.3

1e5

66.2

86.7

1e4

66.4

86.7

1e3

66.3

86.8

1e2

65.9

86.5

max1jM n(|

L

ˆwn,t

i,j |)

66.3

86.2

Eq. (3.150)

66.9

87.1